40 research outputs found
Model Based Clustering for Mixed Data: clustMD
A model based clustering procedure for data of mixed type, clustMD, is
developed using a latent variable model. It is proposed that a latent variable,
following a mixture of Gaussian distributions, generates the observed data of
mixed type. The observed data may be any combination of continuous, binary,
ordinal or nominal variables. clustMD employs a parsimonious covariance
structure for the latent variables, leading to a suite of six clustering models
that vary in complexity and provide an elegant and unified approach to
clustering mixed data. An expectation maximisation (EM) algorithm is used to
estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is
required. The clustMD model is illustrated by clustering simulated mixed type
data and prostate cancer patients, on whom mixed data have been recorded
A mixture of experts model for rank data with applications in election studies
A voting bloc is defined to be a group of voters who have similar voting
preferences. The cleavage of the Irish electorate into voting blocs is of
interest. Irish elections employ a ``single transferable vote'' electoral
system; under this system voters rank some or all of the electoral candidates
in order of preference. These rank votes provide a rich source of preference
information from which inferences about the composition of the electorate may
be drawn. Additionally, the influence of social factors or covariates on the
electorate composition is of interest. A mixture of experts model is a mixture
model in which the model parameters are functions of covariates. A mixture of
experts model for rank data is developed to provide a model-based method to
cluster Irish voters into voting blocs, to examine the influence of social
factors on this clustering and to examine the characteristic preferences of the
voting blocs. The Benter model for rank data is employed as the family of
component densities within the mixture of experts model; generalized linear
model theory is employed to model the influence of covariates on the mixing
proportions. Model fitting is achieved via a hybrid of the EM and MM
algorithms. An example of the methodology is illustrated by examining an Irish
presidential election. The existence of voting blocs in the electorate is
established and it is determined that age and government satisfaction levels
are important factors in influencing voting in this election.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS178 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Inferring food intake from multiple biomarkers using a latent variable model
Metabolomic based approaches have gained much attention in recent years due
to their promising potential to deliver objective tools for assessment of food
intake. In particular, multiple biomarkers have emerged for single foods.
However, there is a lack of statistical tools available for combining multiple
biomarkers to infer food intake. Furthermore, there is a paucity of approaches
for estimating the uncertainty around biomarker based prediction of intake.
Here, to facilitate inference on the relationship between multiple
metabolomic biomarkers and food intake in an intervention study conducted under
the A-DIET research programme, a latent variable model, multiMarker, is
proposed. The proposed model draws on factor analytic and mixture of experts
models, describing intake as a continuous latent variable whose value gives
raise to the observed biomarker values. We employ a mixture of Gaussian
distributions to flexibly model the latent variable. A Bayesian hierarchical
modelling framework provides flexibility to adapt to different biomarker
distributions and facilitates prediction of the latent intake along with its
associated uncertainty.
Simulation studies are conducted to assess the performance of the proposed
multiMarker framework, prior to its application to the motivating application
of quantifying apple intake
Probabilistic principal component analysis for metabolomic data
Background:
Data from metabolomic studies are typically complex and high-dimensional. Principal component analysis (PCA) is currently the most widely used statistical technique for analyzing metabolomic data. However, PCA is limited by the fact that it is not based on a statistical model.
Results:
Here, probabilistic principal component analysis (PPCA) which addresses some of the limitations of PCA, is reviewed and extended. A novel extension of PPCA, called probabilistic principal component and covariates analysis (PPCCA), is introduced which provides a flexible approach to jointly model metabolomic data and additional covariate information. The use of a mixture of PPCA models for discovering the number of inherent groups in metabolomic data is demonstrated. The jackknife technique is employed to construct confidence intervals for estimated model parameters throughout. The optimal number of principal components is determined through the use of the Bayesian Information Criterion model selection tool, which is modified to address the high dimensionality of the data.
Conclusions:
The methods presented are illustrated through an application to metabolomic data sets. Jointly modeling metabolomic data and covariates was successfully achieved and has the potential to provide deeper insight to the underlying data structure. Examination of confidence intervals for the model parameters, such as loadings, allows for principled and clear interpretation of the underlying data structure. A software package called MetabolAnalyze, freely available through the R statistical software, has been developed to facilitate implementation of the presented methods in the metabolomics field.Irish Research Council for Science, Engineering and TechnologyHealth Research Boar
A Latent Shrinkage Position Model for Binary and Count Network Data
Interactions between actors are frequently represented using a network. The
latent position model is widely used for analysing network data, whereby each
actor is positioned in a latent space. Inferring the dimension of this space is
challenging. Often, for simplicity, two dimensions are used or model selection
criteria are employed to select the dimension, but this requires choosing a
criterion and the computational expense of fitting multiple models. Here the
latent shrinkage position model (LSPM) is proposed which intrinsically infers
the effective dimension of the latent space. The LSPM employs a Bayesian
nonparametric multiplicative truncated gamma process prior that ensures
shrinkage of the variance of the latent positions across higher dimensions.
Dimensions with non-negligible variance are deemed most useful to describe the
observed network, inducing automatic inference on the latent space dimension.
While the LSPM is applicable to many network types, logistic and Poisson LSPMs
are developed here for binary and count networks respectively. Inference
proceeds via a Markov chain Monte Carlo algorithm, where novel surrogate
proposal distributions reduce the computational burden. The LSPM's properties
are assessed through simulation studies, and its utility is illustrated through
application to real network datasets. Open source software assists wider
implementation of the LSPM.Comment: 75 pages, 47 figure
Mixed membership models for rank data: Investigating structure in Irish voting data
A mixed membership model is an individual level mixture model where individuals have partial membership of the profiles (or groups) that characterize a population. A mixed membership model for rank data is outlined and illustrated through the analysis of voting in the 2002 Irish general election. This particular election uses a voting system called proportional representation using a single transferable vote (PR-STV) where voters rank some or all of the candidates in order of preference. The data set considered consists of all votes in a constituency from the 2002 Irish general election. Interest lies in highlighting distinct voting profiles within the electorate and studying how voters affiliate themselves to these voting profiles. The mixed membership model for rank data is fitted to the voting data and is shown to give a concise and highly interpretable explanation of voting patterns in this election
Clustering South African households based on their asset status using latent variable models
The Agincourt Health and Demographic Surveillance System has since 2001
conducted a biannual household asset survey in order to quantify household
socio-economic status (SES) in a rural population living in northeast South
Africa. The survey contains binary, ordinal and nominal items. In the absence
of income or expenditure data, the SES landscape in the study population is
explored and described by clustering the households into homogeneous groups
based on their asset status. A model-based approach to clustering the Agincourt
households, based on latent variable models, is proposed. In the case of
modeling binary or ordinal items, item response theory models are employed. For
nominal survey items, a factor analysis model, similar in nature to a
multinomial probit model, is used. Both model types have an underlying latent
variable structure - this similarity is exploited and the models are combined
to produce a hybrid model capable of handling mixed data types. Further, a
mixture of the hybrid models is considered to provide clustering capabilities
within the context of mixed binary, ordinal and nominal response data. The
proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD).
The MFA-MD model is applied to the survey data to cluster the Agincourt
households into homogeneous groups. The model is estimated within the Bayesian
paradigm, using a Markov chain Monte Carlo algorithm. Intuitive groupings
result, providing insight to the different socio-economic strata within the
Agincourt region.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS726 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Selecting Milk Spectra to Develop Equations to Predict Milk Technological Traits
peer-reviewedIncluding all available data when developing equations to relate midinfrared spectra to a phenotype may be suboptimal for poorly represented spectra. Here, an alternative local changepoint approach was developed to predict six milk technological traits from midinfrared spectra. Neighbours were objectively identified for each predictand as those most similar to the predictand using the Mahalanobis distances between the spectral principal components, and subsequently used in partial least square regression (PLSR) analyses. The performance of the local changepoint approach was compared to that of PLSR using all spectra (global PLSR) and another LOCAL approach, whereby a fixed number of neighbours was used in the prediction according to the correlation between the predictand and the available spectra. Global PLSR had the lowest RMSEV for five traits. The local changepoint approach had the lowest RMSEV for one trait; however, it outperformed the LOCAL approach for four traits. When the 5% of the spectra with the greatest Mahalanobis distance from the centre of the global principal component space were analysed, the local changepoint approach outperformed the global PLSR and the LOCAL approach in two and five traits, respectively. The objective selection of neighbours improved the prediction performance compared to utilising a fixed number of neighbours; however, it generally did not outperform the global PLSR
Model Based Clustering for Mixed Data: clustMD
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.Science Foundation Irelan